LightScan: Faster Scan Primitive on CUDA Compatible Manycore Processors
نویسندگان
چکیده
Scan (or prefix sum) is a fundamental and widely used primitive in parallel computing. In this paper, we present LightScan, a faster parallel scan primitive for CUDA-enabled GPUs, which investigates a hybrid model combining intrablock computation and inter-block communication to perform a scan. Our algorithm employs warp shuffle functions to implement fast intra-block computation and takes advantage of globally coherent L2 cache and the associated parallel thread execution (PTX) assembly instructions to realize lightweight inter-block communication. Performance evaluation using a single Tesla K40c GPU shows that LightScan outperforms existing GPU algorithms and implementations, and yields a speedup of up to 2.1, 2.4, 1.5 and 1.2 over the leading CUDPP, Thrust, ModernGPU and CUB implementations running on the same GPU, respectively. Furthermore, LightScan runs up to 8.9 and 257.3 times faster than Intel TBB running on 16 CPU cores and an Intel Xeon Phi 5110P coprocessor, respectively. Source code of LightScan is available at http://cupbb.sourceforge.net.
منابع مشابه
SPAP: A Programming Language for Heterogeneous Many-Core Systems
We present SPAP (Same Program for All Processors), a containerbased programming language for heterogeneous many-core systems. SPAP abstracts away processor-specific concurrency and performance concerns using containers. Each SPAP container is a high level primitive with an STL-like interface. The programmervisible behavior of the container is consistent with its sequential counterpart, which en...
متن کاملThe nonequispaced FFT on graphics processing units
Without doubt, the fast Fourier transform (FFT) belongs to the algorithms with large impact on science and engineering. By appropriate approximations, this scheme has been generalized for arbitrary spatial sampling points. This so called nonequispaced FFT is the core of the sequential NFFT3 library and we discuss its computational costs in detail. On the other hand, programmable graphics proces...
متن کاملSorting using BItonic netwoRk wIth CUDA
Novel “manycore” architectures, such as graphics processors, are high-parallel and high-performance shared-memory architectures [7] born to solve specific problems such as the graphical ones. Those architectures can be exploited to solve a wider range of problems by designing the related algorithm for such architectures. We present a fast sorting algorithm implementing an efficient bitonic sort...
متن کاملSpatial Scan Statistics on the GPGPU
Kulldorff’s spatial scan statistic and the software implementation (SaTScan) are widely used for the detection and evaluation of geographic clusters, particularly within the health care community. Unfortunately, the computational time of the scan statistic depends on a wide variety of variables, and, depending on the chosen parameter settings and operations, the computational time can be on the...
متن کاملEnergy Introspector: Simulation Infrastructure for Power, Temperature, and Reliability Modeling in Manycore Processors
This paper presents an architectureindependent modeling infrastructure called the Energy Introspector for estimating non-functional aspects of processors such as energy, power, temperature, area, delay, sensor, and reliability. The Energy Introspector supports processor modeling through the integration of various modeling tools. It features structural abstraction of physical and microarchitectu...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- CoRR
دوره abs/1604.04815 شماره
صفحات -
تاریخ انتشار 2016